
================== Classication with Krimp ======================


=== 0. Setup & get to know Slim ===

- Before you continue, it is probably a good idea to get to know the 'basic' functionality
of Slim first. For this, follow the instructions in 'My First Slim Experiment.txt'.
- Don't forget to prepare the database(s) you want to use for classification, as described 
in that document.


=== 1. Prepare your database for classification ===

- After preparing your database(s) for Slim, you may wish to add additional info on the
items that define the class label a transaction has.
- Please note: Slim currently only supports databases in which each transaction has a
single class label. Each possible label should be represented by a single item. I.e. for
binary classification, two items should be present in the original database that are used
to label each transaction.
E.g., with items 0 and 1 as labels, example transactions could look like:
0 3 5 8 9 10
1 3 4 8 10
1 2 6 7 9 11 12 13
The first transaction belongs to class 0, the other two to class 1.
- The 'class definition' tells Slim which items to regard as class labels. If a database
has a fixed class definition, it is probably good to add this info to your converted .db
database file.
- Note that item numbers are probably changed when converting your original database to
Slim format. To find out the correction translation of the item numbers, use analysedb 
(see other document, example given as chess.db.analysis.txt).
- A class definition is given by adding a line to the .db file using a plain text editor.
So, open up the file and add a line to the 'header' (which consists of a series of lines 
starting with 2 characters and a semicolon, like ab: and it:). The class definition should
look like:
cl: 0 1
In this case, items 0 and 1 will be considered class labels. More class labels are no
problem, for example:
cl: 3 6 7 9
- (A class definition in a database can be overridden for particular experiments if 
desired, see later on.)


=== 2. Partitioning databases for cross-validation and compressing them ===

- Open up classifycompress.conf.
- The most important configuration directives are:
	* iscName = chess-all-2500d
	This directive tells Slim which candidates are to be used. Also, the database name is 
	extracted from this directive is not given separately (which is the easiest way to go).

	Format: [dbName]-[itemsetType]-[minSup][candidateOrder]
	
	dbName			- wine, chess, mushroom, ...
	itemsetType		- all, closed
	minSup			- absolute minimum support level
	candidateOrder	- the default Slim order as described in our papers is 'd'
	
	Note that lower minsups are often possible with classification than with a regular
	Slim run, as cross-validation is used and the database is split up per class.
	
	* seed = 0
	The number given here is used to seed the random number generator. When set to 0, the
	current time will be used. 
	The random number generator is used to randomize the partitions for cross-validation.
	
	* thresholdBits = 0
	The number given here is used to stop compression earlier. Whenever the gain in length
	of the database is less than thresholdBits, compression is stopped. This helps to 
	achieve better and faster classification as described in our poster "Slimmer, outsmarting Slim".

	* #classDefinition = 9 14 22
	If no class definition is given in the database or that one has to be overridden (e.g.,
	a different target is chosen for the current experiment), uncomment and use this 
	directive. Not used by default.

- Run Slim with this config file. This may take a while, as a lot of compression runs
are to be done. Start with a small database to find out how things work.
- Resulting databases and compression results are stored in Slim\xps\classify. Each run
gets its own directory, based on configuration and a timestamp.


=== 3. The real thing: classification ===

- Go to the directory created for this experiment, as just mentioned.
- If everything went well, a new configuration file called 'classify.conf' has been 
created in this directory. To do the actual classification, start Slim with this file.
(This can be done in several ways, the easiest ways probably being 1) copying the file
to the directory where Slim.exe resides or 2) linking .conf-files to automatically run
with Slim.exe so that double-clicking the file is enough.)
- Simply run this config file to do the actual classification.
- By default, two types of classification are done. In the configuration files, these 
two types are called 'code table matchings', but in the original paper we refer to
these as 'absolute pairing' and 'relative pairing'. See the paper for more details.
(Please note that for relative pairing, the reportSup in classifycompress.conf has to
be low enough. During the Slim process, a code table is written to disk every 
'reportSup' decrease in support. For small datasets, a reportSup of 10 would be 
appropiate, whereas this would results in many files written with larger databases. 
Therefore, the default for this directive is 50.)


=== 4. Inspecting your results ===

- All results are stored in the directory created for this particular experiment.
- In a result directory, you'll find 5 files containing results.
	* confusion-[abs|rel].txt
		A confusion matrix for the last code table pairing (lowest minsup/percentage),
		of both absolute (abs) and relative (rel) pairing.
	* classify-[abs|rel].csv
		Detailed results for both relative and absolute pairings, per fold and overall.
		Each row represents a combination(/pairing/matching) of code tables that are
		used for classification. In the absolute case, the minsup is given in the left
		column, in the relative case, the value represents a relative minsup. (Depending
		on the reportSup, not all code tables are available -- for each class, the 
		absolute minsup that is closest to the desired relative minsup is used.)
		On the right, aggregated statistics are shown.
	* summary.csv
		Gives a summary of all results. It gives the best obtained result (highest 
		average accuracy over all folds) and the statistics of the last classification
		(lowest minsup).
		For easy aggregation into a larger file containing more results, all info is 
		also given in a single line.

References:
-	K. Smets, J. Vreeken. Slim: Mining Itemsets that Compress. 
	In: Proceedings of the SIAM International Conference on Data Mining (SDM'12), pp 236-247, SIAM, 2012.
	http://www.springerlink.com/content/g436881660846127/ (open access)
- 	M. Gandhi, J. Vreeken. Slimmer, outsmarting Slim.
	https://people.mmci.uni-saarland.de/~jilles/pres/ida14-slimmer-poster.pdf

Contact:
-	Koen Smets			@ koen.smets@gmail.com
- 	Jilles Vreeken 		@ jilles@mpi-inf.mpg.de
- 	Manan Gandhi		@ manangandhi7@gmail.com